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Abstract 

In this article, we have proposed several approaches for post processing 
a large ensemble of regression trees or conjunctive rules. These techniques 
are applied to estimation of quantitative traits from markers, to the bench- 
mark "Boston Housing" data set and to some simulated data. The results 
from these experiments show that the methods we have considered here 
are promising. In most cases, the models constructed by post processing 
the learners with partial least squares regression had better prediction 
performance than, for example, the ones produced by the random forest 
or the rulefit algorithms which use equal weights or weights estimated 
from lasso regression. 
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1 Review of Ensemble Methods 

Ensemble learning ([H], [5], [13]) provides solutions to complex statistical pre- 
diction problems by simultaneously using a number of models. By bounding 
false idealizations, focusing on regularities and stable common behavior, ensem- 
ble modeling approaches provide solutions that as a whole outperform the single 
models. 

Some early developments in ensemble learning include by Breiman with 
Bagging (bootstrap aggregating) ([T]) and random forest ([3]), and Freund 
and Shapire with AdaBoost ([5]). These methods involve "random" sampling 
the "space of models" to produce an ensemble of base learners and a "post- 
processing" of these to construct a final prediction model. 

In this article, we review several different approaches for ensemble post- 
processing and propose some new ones. The main point of this article is that 
the base learners in an ensemble can be used as an input to any regression 
model. The choices of different models here in are based on the experience and 
preferences of the authors. 
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In the remainder of this section, we will review the recently proposed impor- 
tant sampling learning ensembles (ISLE) framework [7] for ensemble model gen- 
eration. The rule ensembles are also reviewed herein. In Section 2, we propose 
new ensemble post processing methods including partial least squares regres- 
sion, multivariate kernel smoothing and use of out-of-bag observations. Section 
3 is reserved for examples and simulations by which we compare the methods 
proposed here with the existing ones. Some remarks about hyper parameter 
choice and directions for future research are provided in Section 4. 

1.1 ISLE Approach 

Given a learning task and a relevant data set, we can generate a set of models 
from a predetermined model family. Bagging bootstraps the training data set 
[T] and produces a model for each bootstrap sample. Random forest ([TTJ H]) 
creates a diverse set of models by randomly selecting a few aspects of the data 
set while generating each model. AdaBoost 5| and ARCing f3^ iteratively build 
models by varying case weights (up-weighting cases with large current errors 
and down- weighting those accurately estimated) and employs the weighted sum 
of the estimates of the sequence of models. There have been few attempts to 
unify these ensemble learning methods. One such framework is the ISLE due 
to Popescu & Friedman [7]. 

We are to produce a regression model to predict the continuous outcome 
variable y from p vector of input variables x. We will generate models from a 
given model family ^ = {/(a;, 9) : 9 ^ 0} indexed by the parameter 9. The 
final ensemble models considered by the ISLE framework have an additive form: 

M 

F{x)^wo + Y.w,f{x,9,) (1) 

where {f{x,9j)}jl^^ are base learners selected from ISLE uses a two-step 
approach to produce F{x). The first step involves sampling the space of possible 
models to obtain {6'j}*£i. The second step proceeds with combining the base 
learners by choosing weights {w^ jjlo in ©■ 

The pseudo code to produce M models {/(a;, 9j)}Y=i under ISLE framework 
is given below: 

Algorithm 1.1: ISLE(Af, i/, ??) 

Fo{x) = 0. 
for j=l to M 

{{cj,9j) = argminX;igs,(„) L{yi, Fj^i{xi) + cf{xi,9)) 
(c,e) 
T,ix) = f{x,9,) 
Fjix) = Fj-i{x) + iydjTj{x) 
return {{T/(x)}fL^ and Fm(x).) 
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Here .) is a loss function, Sj{r]) is a subset of the indices {1,2,..., n} 
chosen by a sampling scheme 77,0<z/<lisa memory parameter. 

The classic ensemble methods of Bagging, Random Forest, AdaBoost, and 
Gradient Boosting are special cases of ISLE ensemble model generation proce- 
dure |TS] . In Bagging and Random Forests the weights in [T] are set to predeter- 
mined values, i.e. wq — and wj = for j = 1, 2, . . . , M. Boosting calculates 
these weights in a sequential fashion at each step by having positive memory j/, 
estimating Cj and takes Fm{x) as the final prediction model. 

Friedman & Popescu [7] recommend lear ning the weights \wj } 7— o using 

lasso [T7]. Let T = (rj(a;0)r=i„"=i the nx Al matrix of predictions for the n 
observations by the M models in an ensemble. The weights {wq, w = {'Wm}m=o) 
are obtained from 

M 

w = argmin(y - woln - Tw)'{y - wol,i - Tw) + A ^ \wm\. (2) 

w 

m=l 

A > is the shrinkage operator, larger values of A decreases the number of 
models included in the final prediction model. The final ensemble model is 
given by 

AI 

F{X) = m;o + ^ WmTm{x). (3) 
m— 1 

1.2 Rule Ensembles 

The base learners in the preceding sections of this article can be used with any 
regression model, however usually they are used with regression trees. Each 
decision tree in the ensemble partitions the input space using the product of 
indicator functions of "simple" regions based on several input variables. A 
tree with K terminal nodes define a K partition of the input space where the 
membership to a specific node, say node fc, can be determined by applying the 
conjunctive rule 

p 

1=1 

where 7(.) is the indicator function, x = {xi,X2, ■ ■ ■ , Xp) are the input variables. 
The regions sik are intervals for a continuous variable and a subset of the possible 
values for a categorical variable. 

Given a set of decision trees, rules can be extracted from these trees to define 
a collection of rules. Let R — (?'fc(a^i))iLifcLi be the n x K matrix of rules for 
the n observations by the K rules in the ensemble. The rulefit algorithm of 
Friedman & Popescu [8] uses the weights {wq, w = {tufej^o) ^^^^ ^^'"^ estimated 
from 

K 

w = argmin(y - woln - Rw)'{y ~ woln - Rw) + A \wk\ (4) 
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in the final prediction model 

K 

F{x) = wq +^Wkrk{x). (5) 

fc=i 

2 Post Processing Ensembles Revisited 

We can use the base learners in an ensemble as input variables in any regres- 
sion method. Since the number of models in an ensemble can easily exceed the 
number of observations, we prefer regression methods that can handle high di- 
mensional input. A few such approaches like principal components, partial least 
squares regression, multivariate kernel smoothing and weighting are illustrated 
in this section. We will compare these approaches to the existing standards 
random forests and rulefit in the next section. 

2.1 Principal Components and Partial Least Squares Re- 
gression 

The models in an ensemble are all aligned with the response variable and there- 
fore we should expect that they are correlated with each other. Principal com- 
ponent regression (PGR) and partial least squares regression (PLSR) are two 
techniques which are suitable for high dimensional regression problems where 
the predictor variables exhibit multicollinearity. 

PGR and PLSR decompose the input matrix X into orthogonal scores T 
and loadings P 

X ^TP 

and regress Y on the first few columns of the loadings P using ordinary least 
squares. This leads to biased but low variance estimates of the regression co- 
efficients in model [1] PLSR incorporates information on both X and Y in the 
loadings. 

Both of these methods behave as shrinkage methods [5] where the amount of 
shrinkage is controlled by the number of loadings included. An obvious question 
is to find the number of loadings needed to obtain the best generalization for the 
prediction of new observations. This is, in general, achieved by cross-validation 
techniques such as bootstrapping. 

The illustrations following section demonstrate the good performance of 
PLSR for post processing trees or rules. PLSR, as opposed to lasso, achieves 
shrinkage without forcing sparsity on the input variables. The ensemble learners 
are all "directed" towards the output variable and therefore they exhibit strong 
multicollinearity. This is a case where we would expect PLSR to work better 
than lasso. 

The coefficients of the tree ensemble model in [3] or the rule ensemble model 
in m can be used to evaluate importances of trees, rules and individual input 
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variables [S] . For the tree ensembles the importance of the A:th tree is evaluated 
as 

h = \wk\std{Tk) 

measures the importance of the trees or rules, here std{Tk) denotes the standard 
deviation for the output of the fcth tree over the individuals in the training 
sample. For the rule ensembles the importance of the fcth rule is calculated 
similarly as 

h = \wk\\/ Sk{l - Sk) 

where Sk — is the support of rule k. The individual variable impor- 

tances are calculated from sum of the importances of the trees or rules which 
contain that variable. 

The PLSR model is in the same additive form as in [U therefore the weights 
wi, W2, • . ■ , WM in the model can be used to calculate tree rule or variable impor- 
tances the same way they were calculated for the lasso post processing approach. 

2.2 Multivariate Kernel Smoothing 

We will concentrate on kernel smoothing using the Nadaraya- Watson estimator. 
For a detailed presentation of the subject, we refer the reader to (jB]). The 
Nadaraya- Watson estimator is a weighted sum of the observed responses y. Let 
the value of base learners at an input point x be written in a M dimensional 
vector t(x). The final prediction model at input point x can be obtained as 

p. X ^ 117=1 Kh{t{xi) - tix))yi 

The kernel function Kh{-) is a symmetric function that integrates to one, 
/i > is the smoothing parameter. In practice, the kernel function and the 
smoothing parameter are usually selected using the cross validated or bootstrap 
performances for a range of kernel functions and smoothing parameter values. 

2.3 Weighting Ensembles using Out-of-Bag Observations 

As we have mentioned earlier, most of the earlier important ensemble meth- 
ods combine the base models using weights. Both bagging and random forest 
algorithms use equal weighting. Estimating w by minimizing 

]^{y-Tw)'{y~Tw) 

subject to the constraint w >Q gives the Stacking approach of Wolpert [T^ and 
Breiman [2]. In stacking final prediction model is given by 

F{x) = T{x)w. 

The ensemble generation algorithms based on bootstrapping the observa- 
tions builds the base learners from the observations in the bootstrap sample. 
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and leaves us with the out-of-bag observations to evaluate the generalization 
performance of that particular learner. The following weighting scheme will 
down weight the base learners which have bad generalization performance. Let 
iUoobijXoobi) denote the ith out-of-bag observation for i — 1,2, ... ,noob- We 
have M base learners {Ti{x)}ff^-^^. We can use 

YaLi 111=1 KhiVoobi - Ti{Xoobi)) 

as the prediction of the response at input value x. This involves keeping track of 
the out-of-bag performance each model in the ensemble and using the weights 

Efil J2l=l Kh{yoobi - TiiXoobi)) ' 

l^l,2,...,M. 

The value of h controls the smoothness of the model. For large values of 
this parameter the kernel method will assign approximately equal weights to 
the learners Ti, I — 1,2,...,M and hence it is equivalent to random forest 
weighting. Smaller values of the parameter assigns higher weights to the models 
with small out of bag errors. It is customary to choose h that minimizes the 
cross-validated or bootstrapped errors. In addition, it is sometimes beneficial 
to eliminate the models with lowest weights from the final ensemble. 

3 Illustrations 

The following ensemble models are compared in this section: 

1. r(pslr): Partial Least Squares Regression with Rules, 

2. t(pslr): Partial Least Squares Regression with Trees, 

3. r(lasso): lasso with Rules, 

4. t(lasso): lasso with Trees, 

5. w(oob): Weighting Using Out-of-Bag performance, 

6. wt(oob): Weighting Using Out-of-Bag performance (best 60% of the trees), 

7. rf: Random Forest, 

8. ksr: Kernel Smoothing with Rules, 

9. kst: Kernel Smoothing with Trees. 

In all these models hyper parameters of the models are set using 10 fold 
cross validation in the training sample. 

Our first example involves the Fusarium head blight (FHB) data set that 
is available from the author upon request. A very detailed explanation of this 
data set is given in [TB] . 
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Example 3.1. FHB is a plant disease caused by the fungus Fusarium Gramin- 
earum and results in tremendous losses by reducing grain yield and quality. In 
addition to the decrease in grain yield and quality, another damage due to FHB 
is the contamination of the crop with mycotoxins. Therefore, breeding for im- 
proved FHB resistance is an important breeding goal. Our aim is to build a 
prediction model for FHB resistance in barley based on available genetic vari- 
ables. The FHB data set included FHB measurements along with 2251 single 
nucleotide polymorphisms (SNP) on 622 elite North American barley lines. The 
10 fold cross validated accuracies measured by the correlations of true responses 
to the predicted values are displayed in Figure\^ 

Example 3.2. In our second example we repeat the following experiment 100 
times. Elements of the 150 x 100 input matrix X are independently generated 
from a uniform(0, 1) distribution. The elements of the coefficient matrix (3 were 
also generated independently from unif{0, 1) and 85% of these were selected 
randomly and set to zero. 150 dimensional response vector y was generated 
according to y = Xj3 + e where e was generated from iVi5o(0, 0.3/i5o) so that 
the signal ratio was about 2 to 1. The data was separated as training data and 
test data in the ratio of 2 to 1. The box plots in Figure\^ compare the different 
approaches to ensemble post processing in terms of the accuracies in the test 
data set. 

Example 3.3. In this example, we repeat the experiment in Friedman & Popescu 
(fS^I). Elements of the 1000 x 100 input matrix are independently generated from 
unif{0, 1) distribution. 1000 dimensional response vector y was generated ac- 
cording to {yi = 10 J^^^j^ e^^^'J + X)^=6 ■^y where Ci was generated from 
-/V(0, cr^ = 1). The data was separated as training data and test data in the ratio 
of 2 to 1. The box plots in Figure\^ compare the test data performances of the 
different approaches over 100 replications of the experiment. 

Example 3.4. In order to compare the performance of prediction models we use 
the benchmark data set "Boston Housing" fflOf ). This data set includes n=506 
observations and p—14 variables. The response variable is the median house 
value from the rest of the 13 variables in the data set. 10 fold cross validated 
accuracies are displayed by the box plots in Figure The PLSR approach has 
the best cross validated prediction performance. 

4 Conclusion 

In this article, we have proposed several approaches for post processing a large 
ensemble of prediction models or rules. The approach taken here is to treat the 
ensemble models or the rules as base learners and use them as input variables 
in the regression problem. Some weighting approaches to ensemble models are 
also considered. 

The results from our simulations and benchmark experiments show that 
these post processing methods are promising. In most cases, the proposed mod- 
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Figure 1: 10 fold cross validated accuracies measured by correlation for the 
FHB data set. The ensemble of rules with PLSR has slightly higher accuracy 
compared to its alternative rules with lasso. The number of trees was set to 
200. Maximum depth allowed for each tree or rule was set to 5. 
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Figure 2: The box plots in Figure [2] compare the different approaches to en- 
semble post processing for the scenario in Example 13.21 The number of trees 
generated was 200, maximum depth parameter was set to 2. 
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Figure 3: The box plots in Figure [2] compare the different approaches to ensem- 
ble post processing for the scenario in Example 13.31 Number of trees was set to 
200, and the maximum depth parameter was set to 2. 
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Figure 4: 10 fold cross validated accuracies for the "Boston Housing" data are 
displayed by the box plots. The PLSR approach has the best cross validated 
prediction performance. We have generated 300 trees by the ISLE approach 
and maximum depth parameter was set to 4. For the methods that use a kernel 
function we have uniformly used the Gaussian kernel. The sparsity parame- 
ters of the lasso or PLSR and the kernel width's parameters were obtained by 
minimizing 10 fold cross validated errors in the training data. 
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els had better prediction performances than the ones given by the popular ran- 
dom forest or the rulefit algorithms. PLSR with rules uniformly produced the 
models with best prediction performances. The ensembles based on rules ex- 
tracted from trees, in general, had better performances. 

The complexity of trees or rules in the ensemble increases with the increase 
in number of nodes from the root to the final node (depth). The maximum 
depth is an important parameter since it controls the degree of interactions 
between the input variables incorporated by the ensemble model and the its 
value should be set carefully. It might also be useful to use some degree of cost 
pruning while generating the trees by the ISLE algorithm. 

One last remark: This article argues that individual trees or rules should be 
treated as input variables to the statistical learning problem. It is almost always 
possible to incorporate other input variables like the original variables or their 
functions to our prediction model. The rulefit algorithm of Friedman & Popescu 
optionally includes the input variables along with the rules in an additive model 
and uses lasso regression to estimate the coefficients in the model. Integrating 
additional input variables into the final ensemble is also straightforward with 
PLSR and kernel smoothing. 
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